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ABSTRACT 



A computer processor includes a replay system to replay 
instructions which have not executed properly and a first 
event pipeline coupled to the replay system to process 
instructions including any replayed instructions. A second 
event pipeline is provided to perform additional processing 
on an instruction. The second event pipeline has an ability to 
detect one or more faults occurring therein. The processor 
also includes a synchronization circuit coupled between the 
first event pipeline and the second event pipeline to syn- 
chronize faults occurring in the second event pipeline to 
matching instruction entries in the first event pipeline. 

18 Claims, 9 Drawing Sheets 
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TECHNIQUE FOR SYNCHRONIZING FIG. 3 is a block diagram illustrating a memory system 

FAULTS IN A PROCESSOR HAVING A according to an example embodiment of the present inven- 

REPLAY SYSTEM tion. 

FIG. 4 is a block diagram illustrating a fault detection and 
5 recording circuit according to an example embodiment of 
the present invention. 



FIELD 



The present invention is directed to a computer processor. FIG. 5 is a block diagram illustrating an age guarding 

More particularly, the present invention is directed to a logic according to an example embodiment of the present 

technique for synchronizing faults in a processor having a invention. 

replay system. ™ PIG 5 j s a ^\ oc ^ diagram illustrating a fault detection and 

BACKGROUND recording circuit according to another example embodiment 

of the present invention. 

The primary function of most computer processors is to FIG. 7 is a block diagram illustrating an operation of the 

execute computer instructions. Most processors execute comparison of the sequence numbers between the entries in 

instructions in the programmed order that they are received. the asynchronous event pipeline and the synchronous event 

However, some recent processors, such as the Pentium® II pipeline 

processor from Intel Corp., are "out-of-order" processors. n& „ fa a ^ d illustrati , rtion of , fau „ 

An out-ofnorder processor can execute instructions in any detection and rea)rdin accofdi , o an , e 

order as the data and execution units required for each embodiment of the present invention. 

instruction becomes available. Moreover, the execution of zu , , , .„ . . , 

instructions can generate a wide variety of exceptions or A : 9 ,s • block dagram illustrating a portion of a fault 

faults. A fault handler or exception handler is typically called det ^" on and "cording circuit according to another example 

to handle the fault or exception. Detailed fault information embodiment of the present invention. 

should be stored to allow the fault handler to process or DETAILED DESCRIPTION 

handle the fault. Maintaining the proper fault information 25 

can be a complex task, especially for an out of order I. Introduction 

processor. The prior art has not adequately addressed this According t0 an embodiment of the invention, a processor 

problem, particularly where multiple program flows are jg provi(fed ^ a replay syslem ^ a memory 

active * 30 system. According to an embodiment, instructions may be 

Therefore, there is a need for a technique to maintain the speculatively scheduled for execution before the source data 

proper fault information to allow a fault handler to process ^ available. The replay system includes a checker for 

the fault. determining when an instruction has executed properly or 

SUMMARY improperly (e.g., executed before the source data was 

35 available). 

According to an embodiment, a processor is provided The memory system may be to include one or 

including a replay system to replay instructions which have more synchronous evem pipelines, and one or more asyn- 

not executed properly and a first event pipeline coupled to chronous event pipe ii D es. The synchronous event pipelines 

the replay system to process instructions including any may includCj for example> a store event pipeline ^6 load 

replayed instructions. Asecond event pipeline is provided to 4Q event pipelme . ^ synchronous event pipelines include 

perform additional processing on an instruction. The second slages or a of steps performed t0 process or execule an 

event pipeline has an ability to delect one or more faults instruction. A number of faults or exceptions can occur 

occurring therein. The processor also includes a syndrom- during the execution of an instruction. The faults generated 

zation circuit coupled between the first event pipeline and and identified within the synchronous event pipeline are 

the second event pipeline to synchronize faults occurring in 45 referfed tQ as synchronous faultSt 

the second event pipeline to matching instruction entries in T . . , - ^ ^ 

" r In some circumstances, additional processmg (e.g., longer 

t e rst event pipe ne, latency steps) may be required to process or execute an 

BRIEF DESCRIPTION OF THE DRAWINGS instruction. An example is where there is a local cache miss 

, , , r , (e.g., for a load or store instruction) and the data must be 

The foregoing and a better understanding of the present 5Q retrieved from M external m device (whjch is a , 

invention will become apparent from the following detailed , operation). The stages or steps of such additional 

description of exemplary embodiments and the claims when processing (ltmg lo retrieve the data £rom an external device) 

read in connection with the accompanying drawings, all are ca „ ed hronous event p i pe ii nes . This additional 

forming a part of the disclosure of this invention. While the processing (of the asynchronous event pipeline) is typically 

foregoing and following written and illustrated disclosure J5 exlemal , Q ^ hronous event pipelines. Whi i e , he 

focuses on disclosing example embodiments of the processin g occurring in the synchronous event pipeline may 

invention it should be clearly understood that the same is by lete ffl a p redeterrnin e d or known number of stages or 

way of illustration and example only and is not limited ^ ^ cessin m an synchronous eve n, pipeline is 

thereto. The spint and scope of the present invention being , ess edicta51e and ma be random . ^ completion or 

limited only by the terms of the appended claims. 6Q timing q{ ^ processing in the aS y ncbr0 nous pipeline is not 

The following represents brief descriptions of the synchronized with the processing in the synchronous event 

drawings, wherein: pipeline, and hence the term "asynchronous." There may be 

FIG. 1 is a block diagram illustrating a computer system a significant and unknown delay before the processor can 

that includes a processor according to an embodiment of the determine if a fault occurred during this additional long 

present invention. 65 latency processing in the asynchronous event pipeline. Any 

FIG. 2 is a diagram illustrating a re-order buffer according faults occurring in the asynchronous event pipeline are 

to an example embodiment of the present invention. generally referred to as asynchronous faults. 
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If the oldest instruction in the processor has a fault, a fault 
handler will be called to process or handle the fault. Detailed 
or specific fault information (e.g., a specific fault code and 
a linear address which caused the fault) for the oldest 
instruction may be maintained in a fault register to allow the 
fault handler to identify and handle the fault. According to 
one embodiment of the invention, age guarding logic is 
provided for receiving instruction entries (including specific 
fault information and a sequence number) in parallel from 
several synchronous event pipelines and from several asyn- 
chronous event pipelines. The age guarding logic, which 
may include one or more comparators, maintains fault 
identification information in a fault register for the oldest 
fault (i.e., for the oldest instruction having a fault). 

According to another embodiment of the invention, the 
asynchronous faults are synchronized (or combined or 
merged) with the instruction entries in the synchronous 
event pipelines when the faulting instruction is replayed. A 
synchronization circuit is provided including a buffer and a 
comparator, for example. According to an embodiment, the 
synchronization circuit compares a sequence number of an 
instruction in the asynchronous event pipeline having an 
asynchronous fault to the sequence numbers of the instruc- 
tion entries in the synchronous pipeline. If a match in 
sequence numbers is found, fault identification information 
for the instruction having an asynchronous fault is tagged or 
written to the matching instruction entry in a synchronous 
event pipeline. This synchronization (or combining or 
merging) of asynchronous faults to instruction entries in the 
synchronous pipeline reduces the number of inputs to the 
age guarding logic. The reduced number of inputs to the age 
guarding logic can allow the age guarding logic to have a 
reduced complexity and a decrease in latency. 

II. Overall System Architecture 

Referring to the Figures in which like numerals indicate 
like elements, FIG. 1 is a block diagram illustrating a 
computer system that includes a processor according to an 
embodiment of the present invention. The processor 100 
includes a Front End 112, which may include several units, 
such as an instruction decoder for decoding instructions 
(e.g., for decoding complex instructions into one or more 
micro-operations or uops), a Register Alias Table (RAT) for 
mapping logical registers to physical registers for source 
operands and the destination, and an instruction queue (IQ) 
for temporarily storing instructions. In one embodiment, the 
instructions stored in the instruction queue are micro- 
operations or uops, but other types of instructions can be 
used. The Front End 112 may include different or even 
additional units. According to an embodiment of the 
invention, each instruction includes two logical sources and 
one logical destination, for example. The sources and des- 
tination are logical registers within the processor 100. The 
RAT within the Front End 112 may map logical sources and 
destinations to physical sources and destinations, respec- 
tively. 

Front End 112 is coupled to a scheduler 114. Scheduler 
114 dispatches instructions received from the processor 
Front End 112 (e.g., from the instruction queue of the Front 
End 112) when the resources are available to execute the 
instructions. Normally, scheduler 114 sends out a continuous 
stream of instructions. However, scheduler 114 is able to 
detect, by itself or by receiving a signal, when an instruction 
should not be dispatched. When scheduler 114 detects this, 
it does not dispatch an instruction in the next clock cycle. 
When an instruction is not dispatched, a "hole" is formed in 
the instruction stream from the scheduler 114, and another 
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device can insert an instruction in the hole. The instructions 
are dispatched from scheduler 114 speculatively. Therefore, 
scheduler 114 can dispatch an instruction without first 
determining whether data needed by the instruction is valid 

5 or available. 

Scheduler 114 outputs the instructions to a dispatch 
multiplexer (mux) 116. The output of mux 116 includes two 
parallel paths, including an execution path (beginning at line 
137) that is connected to a memory system 119 (including 

10 memory execution unit 118 and local caches) for execution, 
and a replay path (beginning at line 139) that is connected 
to a replay system 117. The execution path will be briefly 
described first, while the replay path will be described below 
in connection with a description of the replay system 117. 

is A. Memory System 

As shown in FIG. 1, processor 100 includes a memory 
system 119. The memory system 119 includes a memory 
execution unit 118, a translation lookaside buffer (TLB) 121, 
a L0 cache system 120, a LI cache system 122 and an 

20 external bus interface 124. Execution unit 118 is a memory 
execution unit that is responsible for performing memory 
loads (loading data read from memory or cache into a 
register) and stores (data writes from a register to memory 
or cache). Instructions may be executed out of order. 

25 Execution unit 118 is coupled to multiple levels of 
memory devices that store data. First, execution unit 118 is 
directly coupled to L0 cache system 120, which may also be 
referred to as a data cache. As described herein, the term 
"cache system" includes all cache related components, 

30 including cache memory, and cache TAG memory and 
hit/miss logic that determines whether requested data is 
found in the cache memory. L0 cache system 120 is the 
fastest memory device coupled to execution unit 118. In one 
embodiment, L0 cache system 120 is located on the same 

35 semiconductor die as execution unit 118. 

If data requested by execution unit 118 is not found in L0 
cache system 120, execution unit 118 will attempt to retrieve 
the data from additional levels of memory. After the L0 
cache system 120, the next level of memory devices is LI 

40 cache system 122. Accessing LI cache system 122 is typi- 
cally 4-16 times as slow as accessing L0 cache system 120. 
In one embodiment, LI cache system 122 is located on the 
same processor chip as execution unit 118. 

If the data is not found in LI cache system 122, execution 

45 unit 118 is forced to retrieve the data from the next level 
memory device, which is an external memory device 
coupled to an external bus 102. Example external memory 
devices connected to external bus 102 include a L2 cache 
system 106, main memory 104 and disk memory 105. An 

50 external bus interface 124 is coupled between execution unit 
118 and external bus 102. Execution unit 118 may access 
any external memory devices connected to external bus 102 
via external bus interface 124. The next level of memory 
device after LI cache system 122 is an L2 cache system 106. 

55 Access to L2 cache system 106 is typically 4-16 times as 
slow as access to LI cache system 122, for example. 

After L2 cache system 106, the next level of memory 
device is main memory 104, which typically comprises 
dynamic random access memory ("DRAM"), and then disk 

60 memory 105 (e.g., a magnetic hard disk drive). Access to 
main memory 104 and disk memory 105 is substantially 
slower than access to L2 cache system 106. In one embodi- 
ment (not shown), the computer system includes one exter- 
nal bus dedicated to L2 cache system 106, and another 

65 external bus used by all other external memory devices. 
A copy of each instruction (or microop) output from front 
end 112 is also stored in a re-order buffer (ROB) 152 (also 
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known as an instruction pool) in program order. The ROB 
152 also stores several fields or other information for each 
instruction to keep track of the status of each instruction. If 
an instruction has been properly executed (instructions may 
be executed out of order), the instruction is then retired (or 5 
committed to architectural state) in program order. After an 
instruction has been retired, it is then deleted from ROB 152. 

FIG. 2 is a diagram illustrating a ROB according to an 
example embodiment. The ROB 152 is a buffer that includes 
several fields for each instruction or micro-op (UOP), 10 
including: a sequence number 72 which identifies the pro- 
gram order of the instruction, a copy of the instruction 74 or 
UOP itself, an executed field 76 indicating whether or not 
the instruction has been execuLed, the thread 78 for the 
instruction (processor 100 is a multi-threaded machine in 15 
this embodiment), a path field 80 (because there can be 
multiple paths for each thread), a retired field 82 indicating 
that the instruction has executed properly and is ready to be 
retired (instructions must be retired in program order) and an 
exception or fault vector 84 that identifies a general error 20 
exception or fault caused by the execution of the instruction. 
(As used herein, the terms "exception" and "fault" mean 
generally the same thing, or may be used interchangeably). 
The exception vector (or code) 84 is normally clear (e.g., all 
zeros) unless the processor detects a fault/exception during 25 
execution of the instruction. As shown in FIG. 2, the first 
instruction in the ROB 152 has a sequence number N, has 
been executed (indicated by the 1 in the executed field 76), 
belongs to thread 0, is not ready for retirement (indicated by 
the 0 in the RET field 82), and caused a page fault (as 30 
indicated by the 14 stored in the fault vector 84 for this 
instruction), for example. 

According to an embodiment of the invention, the path 
field 80 identifies those instructions which correspond to a 
particular path or program flow. As noted above, processor 35 
100 is a speculative processor in which instructions may be 
executed out of order. When a branch prediction is encoun- 
tered in the instruction stream, branch prediction is used to 
predict whether or not the branch will be taken. Based on the 
prediction, instructions are fetched and executed along a 40 
predicted (or speculative path). If it turns out that the branch 
was mispredicted, instructions are fetched from the correct 
path. To distinguish between the bad instructions of the 
mispredicted path and the new instructions from the correct 
path and, a new path or program flow is assigned to the new 45 
instructions (i.e., the correct path) whenever a mispredicted 
branch is encountered. 

According to an embodiment of the invention, there can 
be multiple sources of faults or exceptions occurring every 
processor cycle. These possible exceptions or faults can be 50 
generated from different sources in the processor. In many 
cases, the execution of an instruction may generate a fault or 
exception. As understood by those skilled in the art, pro- 
cessor 100 can include a variety of execution units, includ- 
ing the memory execution unit 118, an arithmetic logic unit 55 
(ALU), etc. For example, an Arithmetic Logic Unit may 
perform a division or other mathematical operation on data 
and generate a divide error or a floating point error. The 
memory execution unit 118 can generate a variety or 
memory related errors, including an alignment check error 60 
(e.g., where the data does not align with a cache line), a page 
fault, a hardware or bus error (indicated generally as a 
Machine Check error), etc. The Machine Check error could 
include a variety of hardware type errors that may be 
delected by memory execution unit 118, including a bus 65 
error, a parity error, an error correcting code (ECQ error, 
etc. 
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According to an embodiment of the invention, the pro- 
cessor 100 is used with virtual memory in which a linear 
address space is divided into pages that can be mapped into 
physical memory (e.g., main memory 104 and cache) and/or 
disk storage (e.g., disk memory 105). When a program 
references or addresses a logical address in memory, the 
processor translates the logical address into a linear address 
and then uses its paging mechanism to translate the linear 
address into a corresponding physical address. The proces- 
sor uses page tables and page directories stored in memory 
to translate the address. The most recently accessed page 
directory and page table entries are cached in the transaction 
lookaside buffer (TLB)121. If the page containing the linear 
address is not currently in physical memory (i.e., a TLB 
miss), the processor must calculate the physical address 
based on the linear address (i.e., linear address to physical 
address translation), also known as a page walk. If a speci- 
fied page table entry is not present in main memory, the 
address translation cannot be performed and the processor 
generates a page fault (specifically a page not present page 
fault). In response to this page fault, a fault handler in the 
operating system will load the page table entry from disk 
memory 105 into main memory and cache systems, and then 
restart (or re-execute) the instruction that generated the page 
fault. 

B. Replay System 

Referring to FIG. 1 again, processor 100 further includes 
a replay system 117. Replay system 117, like execution unit 
118, receives instructions output by dispatch multiplexer 
116. Execution unit 118 receives instructions from mux 116 
over line 137, while replay system 117 receives instructions 
over line 139. As noted above, according to an embodiment 
of the invention, some instructions can be speculatively 
scheduled for execution before the correct source data for 
them is available (e.g., with the expectation that the data will 
be available in many instances after scheduling and before 
execution). Therefore, it is possible, for example, that the 
correct source data was not yet available at execution time, 
causing the instruction to execute improperly. In such case, 
the instruction will need to be re -executed (or replayed) with 
the correct data. Replay system 117 detects those instruc- 
tions that were not executed properly when they were 
initially dispatched by scheduler 114 and routes them back 
again to the execution unit (e.g., back to the memory 
execution unit 118) for replay or re-execution. 

Replay system 117 includes two staging sections. One 
staging section includes a plurality of staging queues A, B, 
C and D, while a second staging section is provided as 
staging queues E and F, Staging queues delay instructions 
for a fixed number of clock cycles. In one embodiment, 
staging queues A-F each comprise one or more latches. The 
number of stages can vary based on the amount of staging 
or delay desired in each execution channel. Therefore, a 
copy of each dispatched instruction is staged through staging 
queues A-D in parallel to being staged through execution 
unit 118. In this manner, a copy of the instruction is 
maintained in the staging queues A-D and is provided to a 
checker 150, described below. This copy of the instruction 
may then be routed back to mux 116 for re-execution or 
"replay" if the instruction did not execute properly. 

Replay system 117 further includes a checker 150. 
Generally, checker 150 receives instructions output from 
staging queue D and then determines which instructions 
have executed properly and which have not. If the instruc- 
tion has executed properly, the checker 150 declares the 
instruction "replay safe" and the instruction (or a control 
signal identifying the instruction) is forwarded to ROB 152 
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where instructions are retired in program order. Retiring mux 116. In one embodiment, stop scheduler signal 151 

instructions is beneficial to processor 100 because it frees up instructs the scheduler 114 not to schedule an instruction on 

processor resources, thus allowing additional instructions to the next clock cycle. This creates an open slot or "hole" in 

begin execution. the instruction stream output from mux 116 in which a 

An instruction may execute improperly for many reasons. 5 replayed instruction can be inserted. 
Two common reasons are a source (or data register) depen- m M E k Embodiment of A Memory System 
dency and an external replay condition. A source depen- 
dency can occur when a source of a current instruction is FIG * 3 15 a block dia gr am illustrating a memory system 
dependent on the result of another instruction. This data according to an example embodiment. According to an 
dependency can cause the current instruction to execute 10 embodiment, the memory system 119 includes a memory 
improperly if the correct data for the source is not available execution unit 118, a TLB 121, a L0 cache system 120, a LI 
at execution time. Source dependencies are related to the cache s y stem 122 and an external bus interface 124. As 
registers noted above, memory execution unit 118 performs memory 

A scoreboard 140 is coupled to the checker 150. Score- load and store operations. Agroup or sequence of steps (e.g., 
board 140 tracks the readiness of sources. Scoreboard 140 is Unctions or events) that is performed for each load opera- 
keeps track of whether the source data was valid or correct tl0n ma y be referred to as the load event pipeline. Likewise, 
prior to instruction execution. After the instruction has been a £ rou P or or ste P s (functions or events) is also 
executed, checker 150 can read or query the scoreboard 140 performed for each store operation may be called the store 
to determine whether data sources were not correct. If the evenl pipeline. Both of these pipelines are synchronous 
sources were not correct at execution time, this indicates that 20 event Palmes. According to an embodiment, synchronous 
the instruction did not execute properly (due to a register event Palmes refer to the events or stages which all 
data dependency), and the instruction should therefore be instructions (of that type) must pass through, 
replayed. During each load or store operation a number of excep- 

Examples of an external replay condition may include a tions or faults can be generated. As used herein, faults (or 

local cache miss (e.g., source data was not found in L0 cache 25 exceptions) can be categorized as either "synchronous" or 

system 120 at execution time), incorrect forwarding of data "asynchronous." Synchronous faults are those faults which 

(e.g., from a store buffer to a load), hidden memory ar e generated and identified immediately within the execu- 

dependencies, a write back conflict, an unknown data/ tion pipeline or synchronous event pipeline (e.g., faults 

address, and serializing instructions. The L0 cache system having a low latency). Examples of synchronous faults (or 

120 generates a L0 cache miss signal as an external replay 30 low latency faults) include a segmentation fault, an align- 

signal 145 to checker 150 if there was a cache miss to L0 ment fault, and stack fault because each of these exceptions 

cache system 120 (which indicates that the source data for or faults can be identified within the load or store event 

the instruction was not found in L0 cache system 120). pipeline. The series of events or functions performed to 

Other signals can similarly be generated to checker 150 to process or execute an instruction may be referred to as a 

indicate the occurrence of other external replay conditions. 35 synchronous event pipeline. 

In this manner, checker 150 can determine whether each In some cases, however, additional steps or functions 

instruction has executed properly or not. If the checker 150 must be performed, which may be referred to as asynchro- 

determines that the instruction has not executed properly, the nous event pipelines. These additional processing steps 

instruction will then be returned to multiplexer 116 via usually involve longer latency operations, and are referred to 

replay loop 156 to be replayed (i.e., to be re-executed at 40 as asynchronous event pipelines. Thus, an asynchronous 

execution unit 118). Instructions routed via the replay loop fault may be considered to be a longer latency fault which 

156 are coupled to mux 116 via line 161. cannot be identified within the load or store event pipeline, 

In addition, instructions which did not execute properly but requires a longer period of time. Asynchronous faults 

can alternatively be routed to a replay queue 170 for can occur during one or more long latency operations or 

temporary storage before being output to mux 116 for replay. 45 procedures performed in an asynchronous pipeline. One 

The replay queue 170 can be used to store certain long example where an asynchronous (or long latency) fault can 

latency instructions (for example) and their dependent occur is where there is a L0 cache system miss and a LI 

instructions until the long latency instruction is ready for cache system miss, requiring an external bus request (a long 

execution. One example of a long latency instruction is a latency operation). An asynchronous or long latency fault 

load instruction where the data must be retrieved from 50 occurs if a bus error, parity error, ECC error or other 

external memory (i.e., where there is a LI cache miss). hardware error occurs based on the data retrieved from 

When the data returns from external memory (e.g., from L2 external memory 101. The process of retrieving data from an 

cache system 106 or main memory 104), the instructions in external memory device 101 may take many processor clock 

the replay queue 170 are then unloaded from replay queue cycles to complete, and this itself creates a long latency. 

170 to mux 116 and output by mux 170 for replay. Memory 55 Thus, the processor cannot detect possible errors (e.g., bus 

system 119 may signal the checker 150, for example, when error, parity or ECC error) in this external data retrieval until 

the requested data for an instruction stored in replay queue the data has retrieved from the external memory device and 

170 has returned from external memory. checked. Another example of an asynchronous or long 

In conjunction with sending a replayed instruction (either latency fault is where a page fault occurs due to a page not 

along replay loop 156 or output from replay queue 170) to 60 present (page table entry not in main memory), as described 

mux 116, checker 150 sends a "stop scheduler" signal 151 above. These are only two examples of long latency or 

to scheduler 114. In such case, checker 150 (or other circuit asynchronous faults, and others may be possible. The series 

also sends a control signal to mux 116 identifying whether of steps or functions required to perform such additional 

the line 161 (from replay loop 156) or line 171 (from replay long latency processing may be referred to as an asynchro- 

queue 170) should be selected. According to an embodiment 65 nous event pipeline. 

of the invention, stop scheduler signal 151 is sent to sched- The terms "synchronous" and "asynchronous" will now 

uler 114 in advance of the replayed instruction reaching the be additionally explained or clarified, according to an 
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embodiment. According to an embodiment, a synchronous pipeline 2 (e.g., an event pipeline for stores). During instruc- 

event pipeline refers to the steps or events which all (or at tion execution or processing in these synchronous pipelines, 

least many) instructions go through. An asynchronous event the FDRC 131 of the memory execution unit 118 reports a 

pipeline refers to additional processing (usually long latency general fault code or vector to ROB 152 (which identifies the 

in nature) that may be performed on some instructions. 5 general type of fault) if a fault occurred. The general type of 

Synchronous faults refer to faulLs that can be detected in a fault (e.g., page fault, vector 14) is then stored in ROB 152. 

synchronous event pipeline. Asynchronous faults are those This general fault code does not usually provide sufficient 

faults that are generated and detected during additional fault information to allow the fault handler to process the 

processing in the asynchronous pipeline. fault. Thus, more detailed fault information (including a 

The additional processing steps or events in the asynchro- 10 detailed fault code more specifically identifying the type of 

nous pipeline are not synchronous to the synchronous event fault) will be stored in a fault register, 

pipeline and the replay queue. As an example, if there are 10 Each synchronous event pipeline receives as an input an 

stages in the synchronous event pipeline (e.g., for a load) instruction (or uop) ID information and a linear address as 

and 10 stages in the (parallel) replay system. A normal the inputs for a series of instructions (e.g., load instructions), 

instruction will complete execution after 10 stages. If there is The instruction ID information can include an instruction or 

was any reason for replay, such as data dependency, and instruction ID (which identifies the instruction or uop or 

assuming that the data dependency was resolved in one opcode), a program flow or thread ID (identifying the 

replay latency, the instruction will complete its execution in program flow for the instruction or uop) and a sequence 

30 stages (10 for initial execution, 10 to pass through the number. As the instruction is processed along each stage of 

replay system and 10 to re-execute). Thus, in general, an 20 the event pipeline, a specific fault code (identifying the 

instruction in the synchronous event pipeline will complete specific type of fault) is generated as well if a fault or 

execution in: exception occurred during execution. All of this information 

10 stages+(number of replays)*20 stages. An instruction that (instruction ID information, linear address or operand(s) and 

requires additional processing (i.e., processing in an asyn- a specific fault code for each instruction) is input to the AGL 

chronous event pipeline) will complete in a less predictable 25 410 for each synchronous pipeline. If no fault occurs, then 

number stages (i.e., complete after an unknown delay or the specific fault code is clear (e.g., all zeros), for example, 

latency). Thus, it can be seen that the term "asynchronous" A plurality of asynchronous event pipelines are also 

can be used to describe the additional processing because the shown in FIG. 4, including an asynchronous event pipeline 

instruction that is processed in the asynchronous event 1 (e.g., representing the functions or events performed by the 

pipeline does not complete in a predictable time or manner 30 machine check circuit 137) and an asynchronous pipeline 2 

like in the synchronous event pipeline, which typically (e.g., representing the functions or events performed by the 

completes execution in 10+20* (number of replays) stages, TLBMH 133). These pipelines may be considered asynchro- 

as an example. The latency caused from the execution in the nous because they each involve performing long latency 

additional processing or asynchronous pipeline may be a operations (such as an address translation or an external bus 

random number of stages. 35 request), and thus, can result in asynchronous faults. The 

Referring again to FIG. 3, the memory execution unit outputs of these asynchronous pipelines may include an 

includes a fault detection and recording circuit (FDRC) 131 instruction ID information, a linear address or operand(s) 

for detecting and recording faults/exceptions. An external and a specific fault code for each instruction (if a fault 

memory controller 135 issues and tracks bus requests over occurred). Age guarding logic(AGL) 410 receives instruc- 

the external bus 102 via external bus interface 124 (e.g., 40 tion identification information, operands (or a linear 

when there is a cache miss on L0 cache and LI cache). A address) and a specific fault code (if a fault occurred) from 

machine check circuit 137 detects hardware errors (e.g., each asynchronous pipeline as well, 

parity errors, ECC errors and the like) and bus errors and Thus, it can be seen that the AGL 410 receives inputs from 

reports these errors to the FDRC 131. A TLB miss handler two synchronous event pipelines and two asynchronous 

(TLBMH) 133 performs the linear address to physical 45 event pipelines, as an example (but any number of inputs 

address translation in the event there is a TLB miss, and could be received). The function of AGL 410 is to maintain 

reports any errors or faults in that process to the FDRC 131. fault identification information (e.g., specific fault code, 

One example of a fault that could be detected by the address and sequence number) for the oldest fault (e.g., 

TLBMH 133 is when a page table entry is missing from smallest sequence number) in each fault register on a per 

main memory (during linear to physical address translation), 50 program flow (or per thread) basis. For example, fault 

causing a type of page fault (i.e., page not present). The steps register FR1 is assigned to thread 1, FR2 is assigned to 

or functions performed by the TLBMH 133 may be consid- thread 3, . . . and FRP is assigned to thread P. Thus, the 

ered one asynchronous event pipeline. While, the steps or function of AGL 410 is to maintain the specific fault code for 

functions performed by the machine check circuit 137 may the oldest fault in thread 1 in FR1, to maintain the specific 

be considered to be another asynchronous event pipeline. 55 fault code for the oldest fault in thread 2 in FR2, etc. 

A. An Example Fault Detection and Recording Circuit According to an embodiment of the invention, the AGL 410 

1. Processing Parallel Fault Sources Without Fault Syn- compares the sequence numbers for the four inputs where a 

chronization fault is indicated (e.g., a nonzero specific fault code) and 

FIG. 4 is a block diagram illustrating a fault detection and selects the oldest fault on a per thread (or per program flow) 

recording circuit according to an example embodiment. The 60 basis. The inputs are simply discarded where no fault is 

fault detection and recording circuit (FDRC) 131 includes indicated. The oldest fault for each thread or program flow 

buffers 412 and 414, age guarding logic 410, and one or is then compared to the corresponding fault register to 

more fault (or exception) registers 416 (e.g., one fault determine if the newly received fault is older than the fault 

register per program flow or thread). already recorded in the fault register. As noted above, 

A plurality of synchronous event pipelines are shown in 65 instructions can be executed out of order in processor 100, 

FIG. 4, including, for example, a synchronous event pipeline and thus, an older fault (e.g., smaller sequence number) may 

1 (e.g., event pipeline for loads) and a synchronous event replace a newer fault already present in the fault register. 
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The AGL410 can be implemented, for example, using an each replayed instruction will again reappear in a synchro- 
arrangement of comparators, typically involving two or nous event pipeline to be processed or re -executed, 
more stages. FIG. 5 is a block diagram illustrating an age In addition, the instruction may also be processed by one 
guarding logic according to an example embodiment of the or more specialized units (e.g., TLBMH 133 or machine 
present invention. As shown in FIG. 5, the AGL410 includes 5 check circuit the steps or operation of each also known 
comparators 505, 510, 515 and 520, with each comparator generally as an asynchronous event pipeline, to perform a 
selecting and outputting the oldest fault value (i.e., selecting relatively long latency process or operation (such as an 
and outputting the fault having the smaller sequence address translation) required to complete execuUon of the 
number). Comparator 505 compares inputs 1 and 2, while instmction Two example asynchronous event pipelines are 
' * i j j r*i , eie <fl shown in FIG. 6 (asynchronous event pipeline 1 and asyn- 
cora para tor 510 compares inputs 3 and 4. Comparator 515 10 , , \ v -v A v - J 

r , r . j Via ^ chronous event pipeline 2). An asynchronous fault (e.g., 

compares the outputs of comparators 505 and 510. Com- feuU) F £ ve been genera f ed for tnis instruction 

parator 520 compares the output of comparator 515 and the dud instruction processing or exe cution in the asynchro- 

sequence number of the fault recorded in the fault register nous event pipelinet Because the instruction is being 

FR. If the fault selected by comparator 515 is older (e.g., rep layed, the instruction will pass through the synchronous 

smaller sequence number) than the fault presently recorded is cvcnt p i pe ii ne each time it is replayed (or re-executed), 

or stored in the fault register, then the fault recorded in the According to an embodiment, a specific fault code for an 

fault register is replaced with the older fault. Otherwise, the instruction from the asynchronous event pipeline having a 

fault recorded in the fault register is left unchanged. This can fault can be written or tagged to the matching or correspond- 

be performed for each of the fault registers (one fault register ing instruction entry in the synchronous pipeline when the 

per thread or program flow). 20 matching instruction entry arrives in the synchronous event 

The AGL circuit illustrated in FIG. 5 is provided merely pipeline. This, in essence, allows asynchronous faults to be 

as an example. There are a wide variety of ways in which synchronized or to be transformed into synchronous faults, 

this circuit can be implemented. For example, all four inputs In this manner, asynchronous faults and synchronous faults 

can be compared in a single clock cycle using one level of can be merged or combined. This reduces the number of 

comparators. In general, this would decrease the processing 25 fault sources input into the age guarding logic (AGL) 610, 

latency at the expense of increasing cycle time (reducing the thereby reducing circuit complexity and latency, 

frequency), and would typically require more area (or Referring to FIG. 6, an asynchronous event pipeline 1 

silicon) to achieve because more aggressive circuit designs processes an instruction and generates an asynchronous fault 

for comparators are needed. (e.g., a page fault). The asynchronous event pipeline 1 

2, Synchronizing Asynchronous Faults 30 outputs instruction entries, which may be a simplified 

FIG. 6 is a block diagram illustrating a fault detection and instruction entry. The simplified instruction entry could 

recording circuit according to another example embodiment. include a sequence number, a thread ID or program flow ID 

The fault detection and recording circuit includes buffers and an identification of the specific fault (which may be the 

602 and 605, an age guarding logic (AGL) 610, an asyn- specific fault code or some subset of this code, for example), 

chronous fault buffer 615 (with AGL) and a plurality of fault 35 This instruction entry from the asynchronous event pipeline 

registers FR1, FR2, . . . FRR One function of the fault is stored in a buffer 602 (there may be many buffers 602). A 

detection and recording circuit of FIG. 6 is to maintain fault comparison is then performed between the sequence number 

identification information (e.g., a specific fault code, the of the instruction entry in buffer 602 and the sequence 

linear address and sequence number) for the oldest fault per numbers of the instruction entries in each of the synchronous 

thread (or per program flow) in the corresponding fault 40 event pipelines. 

register. However, while the circuit of FIGS. 4-5 receives FIG. 7 is a block diagram illustrating an operation of the 

and processes all fault sources (event pipelines) in parallel, comparison of the sequence numbers between the entries in 

the circuit of FIG. 6 synchronizes the asynchronous faults the asynchronous event pipeline and the synchronous event 

with the instruction entries in the synchronous pipeline. This pipeline according to an example embodiment. As shown in 

can result in a faster AGL circuit and which requires less 45 FIG. 7, a comparator 705 compares the sequence number of 

area as compared to some implementations of the parallel the asynchronous pipeline entry stored in buffer 602 to the 

circuit of FIGS. 4-5. Also, less instruction information must sequence number of each the instruction entries in one or 

be carried in the asynchronous pipleline for the circuit of more synchronous pipelines (as the instruction entries enter 

FIG. 6. or pass through the synchronous pipeline). If a match is 

The ability to synchronize (e.g., merge or combine) the 50 found, the specific fault code (or information identifying the 
asynchronous faults with the instruction entries in the syn- specific asynchronous fault) is written or tagged to the 
chronous pipeline is made possible by the replay system 117 matching instruction entry in the synchronous event pipe- 
in one embodiment. In the absence of the Replay System line. This synchronizes the asynchronous faults with the 
117, the processor 100 may otherwise stall while awaiting synchronous pipeline, in effect, transforming the asynchro- 
the processor to process a long latency fault (e.g., for a page 55 nous fault into a synchronous fault. Therefore, buffer 602 
not present type of page fault — a processor could stall while and comparator 705 may be considered as part of a syn- 
waiting for the new page table entry to be loaded into chronization circuit 630 to synchronize asynchronous faults 
memory and then to complete the linear to physical address with the instruction entries in the synchronous pipelines, 
translation). It should be noted that only the simplified instruction 

However, according to an embodiment of the invention, 60 (e.g., sequence number, specific fault identification and 

instructions are speculatively scheduled for execution before thread ID) is required in the asynchronous event pipelines 

the source data is available. As a result, if a load or store because the instruction entries in the synchronous pipelines 

instruction is executed improperly because the data was not provide the remaining instruction information for each 

available at execution time (e.g., because there is a local instruction (e.g., uop or instruction ID, linear address). This 

cache miss requiring an external bus request to retrieve the 65 simplifies the asynchronous event pipeline, 

data), this will cause the checker 150 to replay or re-execute The use of a simple input buffer (e.g., buffer 602) to store 

the load or store instruction until it executes properly. Thus, instruction entries from the asynchronous pipeline works 



05/07/2004, EAST Version: 1.4.1 



US 6,629,271 Bl 
13 14 

fine where the asynchronous pipeline generates faults which Also, mux 625 may be one mux, or may be a plurality of 

are in program order (e.g., such as the machine check circuit muxes for selecting and outputting the entries or sequence 

137). In such a case, the older fault will typically be written numbers from a plurality of temporary registers for com- 

or tagged back to the matching instruction entry in the parison with the instruction entries in the synchronous 

synchronous event pipeline before a newer fault will be 5 pipelines. Thus, the use of multiple muxes 625 allows the 

received at buffer 602 from the asynchronous event pipeline. sequence numbers for a plurality of threads (or program 

A problem may arise, however, if the instruction entries flows ) 10 be simultaneously compared to the sequence 

from the asynchronous event pipeline can be generated out aum ^ of the instruction entries in the synchronous event 

of order. An example is the TLB miss handler (TLBMH)133 Pipelines. 

(FIG. 3) which can speculatively perform a linear to physical 10 , ( In tne ^mple shown in FIG 6 the four fault sources 

j, ' . . / . , f j ' * ♦* i (two synchronous fault sources and two asynchronous fault 

address translation for different out of order ins ructions, and are Qr me d imo b 

faults can occur on each translation As an uUustration an writing ' or lagging the fault ^ (or fauU inforrnation ) for 

instruction entry is stored in buffer 602 and then an older asynchronous faults into the matching or corresponding 

instruction entry (e.g., having a smaller sequence number) is instruction entries in the synchronous event pipeline. In 

output to buffer 602 from an asynchronous event pipeline, is general, M fault sources can be merged or combined into N 

This may cause the older instruction entry in the buffer 602 f au i t sources, where N is less than M (and also where the N 

to be overwritten (and lost) by the newer entry, which is a fault sources are provided in event pipelines which receive 

problem because a function of the fault detection and replayed instructions). The two combined fault sources 

recording circuit 131 is to maintain the oldest faults (e.g., the (each combining synchronous and asynchronous faults) are 

specific fault code and linear address for the fault) in the 20 then input into AGL 610. AGL 610 performs a similar 

fault registers. function to AGL 410 (FIG. 4), but has fewer inputs (e.g., 

According to an example embodiment, one solution to only two inputs in this example) for comparison. AGL 610, 

this problem is to use an asynchronous fault buffer 615 with however, could receive and compare any number of fault 

age guarding logic (AGL) 620, shown at the lower portion source inputs to maintain the oldest faults (fault information) 

of FIG. 6. The asynchronous fault buffer 615 includes an age 25 stored in the fault registers (FR) on a per thread or per 

guarding logic (AGL) 620, a plurality of temporary registers program flow basis. 

(TR1, TR2, TR3, . . . TRP), and a multiplexer (mux) 625. FIG. 8 is a block diagram illustrating a portion of a fault 

There is preferably one temporary register (TR) per thread detection and recording circuit according to an example 

or program flow (i.e., same number of temporary registers embodiment of the present invention. As noted above, after 

TR as the number of fault registers, FR). 30 execution of an instruction, a fault code 84 will be written 

As shown in FIG. 6, the instruction entries (which may back to the ROB 152 (FIGS. 1-2). In FIG. 8, a synchronous 

preferably include only the simplified instruction entries) event pipeline 1 includes several stages, including a write 

from the asynchronous event pipeline 2 are output to buffer back (WB) stage 805 to write back to the ROB 152. 

605. AGL 620 operates similarly to (or the same as) the AGL Likewise, an asynchronous event pipeline includes several 

410 (FIG. 4). The AGL 620 performs an age guarding 35 stages, including a write back (WB) stage 810. An output of 

function for the asynchronous fault buffer 615 to maintain WB stage 805 is connected to a write port 1 of ROB 152, 

the oldest instruction entry in each the temporary registers while an output of WB stage 810 is connected to a write port 

(TR), on a per thread or per program flow basis. If the 2 of ROB 152. ROB 152 may include a mux or selection 

instruction entry in buffer 605 has a fault and is older than circuit for selecting fault information (e.g., a fault code) 

the instruction entry stored in the corresponding temporary 40 received on one of the write ports. Although only one 

register (of the same thread or program flow), the AGL 620 synchronous pipeline and one asynchronous pipeline are 

replaces the entry in the temporary register (TR) with the shown, it should be understood that any number of pipelines 

older instruction entry having a fault. Mux 625 then outputs can be provided. As shown in FIG. 8, if fault synchroniza- 

one or more of the instruction entries in the temporary tion is not performed, ROB 152 will typically include one 

registers (TR) for comparing the sequence number of the TR 45 write port for each fault source. 

entry to the sequence number of the instruction entries in FIG. 9 is a block diagram illustrating a portion of a fault 

each of the synchronous event pipelines. This comparison detection and recording circuit according to another example 

operation can be performed, for example, as described above embodiment of the present invention. In FIG. 9, faults from 

for FIG. 7. If the sequence numbers match, then a specific the asynchronous event pipeline 1 are synchronized to 

fault code (or information identifying the specific asynchro- 50 corresponding instruction entries in the synchronous event 

nous fault) is written or tagged to the matching instruction pipeline 1. The circuit of FIG. 9 operates in a similar fashion 

entry in the synchronous event pipeline. to that shown in FIG. 6. A buffer (such as either buffer 602 

The use of age guarding logic (AGL) 620 in buffer 615 or a asynchronous fault buffer 615) is provided between the 

allows faults to be received and handled out of order without asynchronous event pipeline 1 and the synchronous event 

overwriting or losing the older faults (the older instruction 55 pipeline 1. As described above in connection with FIG. 6, if 

entries having a fault). Therefore, buffer 605, asynchronous there is a match between sequence numbers, specific fault 

fault buffer 615 (with AGL 620) and comparator 705 may be information of an asynchronous fault (such as a specific fault 

considered as part of a synchronization circuit 635 to code) is written to a corresponding instruction entry in the 

synchronize asynchronous faults (including out of order synchronous event pipeline 1. Thus, by synchronizing the 

faults) with the instruction entries in the synchronous pipe- 60 asynchronous faults and synchronous faults, two fault 

lines. sources are merged or combined into a single fault source. 

Although only one buffer 602 is shown, and only one As a result, an additional advantage of fault synchronization 

asynchronous fault buffer 615 (with AGL) is shown, the is that ROB 152 requires fewer write ports. In the example 

circuit may include a plurality of such buffers and the of FIG. 9, only one write port is required, thereby simpli- 

associated or required number of comparators (e.g., com- 65 fying ROB 152. 

parators 705) for synchronizing a plurality of asynchronous Referring again to FIG. 1, the ROB monitors the instruc- 

fault sources with a plurality of synchronous event pipelines. tions and marks those instructions which execute properly as 
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being ready for retirement (but will be retired in program 
order). If an instruction generated a fault, the ROB 152 
receives a general fault/exception vector (e.g., from the 
memory execution unit ) and stores the fault vector in the 
ROB entry for the instruction. If the oldest instruction has a 5 
fault or exception, the ROB 152 then calls the appropriate 
fault handler (a program for handling the fault). The fault 
handler then reads the specific (or detailed) fault code and 
linear address stored in the fault register (alternatively, this 
information may be first pushed onto a stack prior to calling 10 
the fault handler). As an example, a page fault handler may 
retrieve and load the corresponding page table entry if there 
was a page not present type of page fault. After the fault is 
handled, the processor will continue processing instructions 
beginning at either the instruction that caused the fault, or an is 
instruction after the faulting instruction, depending on the 
type of exception or fault. Therefore, the specific fault/ 
exception code and linear address for the oldest fault may be 
stored in the fault registers to allow the fault handler to 
identify and process (i.e., handle) the fault. 20 

Several embodiments of the present invention are spe- 
cifically illustrated and/or described herein. However, it will 
be appreciated that modifications and variations of the 
present invention are covered by the above teachings and 
within the purview of the appended claims without departing 25 
from the spirit and intended scope of the invention. 

What is claimed is: 

1 . A processor comprising: 

a replay system to replay instructions which have not 
executed properly; 30 

a first processing circuit coupled to the replay system to 
process instructions including any replayed instruc- 
tions; 

a second processing circuit to perform additional process- 35 
ing on an instruction, the second processing circuit 
having an ability to detect one or more faults occurring 
therein; and 

a synchronization circuit coupled between the first pro- 
cessing circuit and the second processing circuit to 40 
synchronize faults occurring in the second processing 
circuit to matching instruction entries in the first pro- 
cessing circuit. 

2. The processor or claim 1 wherein the first processing 
circuit comprises a first event pipeline, and wherein the 45 
second processing circuit comprises a second event pipeline. 

3. A method comprising: 

detecting that an instruction has not executed properly; 
replaying an instruction in a first event pipeline; 
performing additional processing on an instruction in a 50 

second event pipeline; 
detecting a fault for the instruction occurring in the 

second event pipeline; 
matching the instruction having a fault which occurred in 55 

the second event pipeline to an instruction entry for the 

same instruction that is being replayed in the first event 

pipeline; 

writing or tagging fault identification information for the 
fault to the matching instruction entry in the first event 60 
pipeline. 

4. A processor comprising: 

a replay system to replay instructions which have not 
executed properly; 

a synchronous event pipeline coupled to the replay system 65 
to process instructions including any replayed instruc- 
tions; 
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an asynchronous event pipeline to perform additional 
processing on an instruction, the asynchronous event 
pipeline having an ability to detect one or more asyn- 
chronous faults occurring during the additional instruc- 
tion processing; and 

a synchronization circuit coupled between the synchro- 
nous event pipeline and the asynchronous event pipe- 
line to synchronize an asynchronous fault with a 
replayed instruction in the synchronous event pipeline. 

5. The processor of claim 4 wherein the synchronization 
circuit comprises: 

a buffer coupled to the asynchronous event pipeline for 
storing an instruction entry for an instruction having an 
asynchronous fault, the instruction entry including a 
sequence number and fault identification information 
identifying the asynchronous fault; and 

a comparator coupled to the buffer and the synchronous 
event pipeline, the comparator comparing a sequence 
number of an instruction entry in the synchronous event 
pipeline with the sequence number of the instruction 
entry having the asynchronous fault, and writing the 
fault identification information of the asynchronous 
fault to the instruction entry in the synchronous event 
pipeline if a match is found. 

6. The process of claim 4 wherein the synchronization 
circuit comprises: 

an input buffer coupled to the asynchronous event pipe- 
line for storing an instruction entry for an instruction 
having an asynchronous fault, the instruction entry 
including a sequence number and fault identification 
information identifying the asynchronous fault; 

an asynchronous fault buffer having age guarding logic 
coupled to the input buffer, the asynchronous fault 
buffer storing an instruction entry for an instruction 
having an asynchronous fault in a temporary register; 
and 

a comparator coupled to the asynchronous fault buffer and 
the synchronous event pipeline, the comparator com- 
paring a sequence number of an instruction entry in the 
synchronous event pipeline with a sequence number of 
the instruction entry stored in the temporary register of 
the asynchronous fault buffer, and writing the fault 
identification information of the asynchronous fault to 
the instruction entry in the synchronous event pipeline 
if a match is found. 

7. The processor of claim 6 wherein the asynchronous 
fault buffer comprises: 

age guarding logic coupled to the input buffer; and 
a temporary register coupled to the age guarding logic to 
store an instruction entry for an instruction having an 
asynchronous fault, the age guarding logic maintaining 
an oldest instruction entry in the register as compared 
between an instruction entry having an asynchronous 
fault already present in the temporary register and an 
instruction entry for an instruction having an asynchro- 
nous fault stored in the input buffer. 

8. The processor of claim 6 wherein the asynchronous 
fault buffer comprises: 

age guarding logic coupled to the input buffer; 

one or more temporary registers coupled to the age 
guarding logic to store one or more oldest instruction 
entries for an instruction having an asynchronous fault, 
the age guarding logic maintaining an oldest instruction 
entry in each temporary register as compared between 
an instruction entry having an asynchronous fault 



05/07/2004, EAST Version: 1.4.1 



US 6,629,271 Bl 



17 



18 



10 



IS 



20 



25 



already present in the temporary register and an instruc- 
tion entry for an instruction having an asynchronous 
fault stored in the input buffer; and 
a mix coupled to the one or more temporary registers for 
selectively outputting the instruction entry from a 
selected temporary register to the comparator. 

9. The processor of claim 8 wherein there is one tempo- 
rary register provided for each thread or program flow. 

10. A processor comprising: 

a replay system to replay instructions which have not 
executed properly; 

a memory execution unit coupled to the replay system to 
execute load and store instructions, the memory execu- 
tion unit including: 

a synchronous event pipeline to execute load or store 
instructions, including instructions which have been 
replayed; 

asynchronous event pipeline coupled to the synchro- 
nous event pipeline for performing additional pro- 
cessing on instructions, the asynchronous event pipe- 
line having an ability to detect one or more 
asynchronous faults occurring during the additional 
instruction processing in the asynchronous event 
pipeline; and 

a synchronization circuit coupled between the synchro- 
nous event pipeline and the asynchronous event pipe- 
line to synchronize an asynchronous fault with a 
replayed instruction in the synchronous event pipeline. 

11. The processor of claim 10 wherein said synchroniza- 30 
tion circuit comprises a comparator circuit to compare a 
sequence number of an instruction entry in the asynchronous 
event pipeline having a fault to sequence numbers of instruc- 
tion entries in the synchronous pipeline, the comparator 
circuit writing fault identification information identifying 35 
the asynchronous fault to the instruction entry in the syn- 
chronous event pipeline if a match is found. 

12. A processor comprising: 

a replay system to replay instructions which have not 
executed properly; 

a plurality of synchronous event pipelines coupled to the 
replay system to process instructions including any 
replayed instructions; 

an asynchronous event pipeline to perform additional 
processing on an instruction, the asynchronous event 
pipeline having an ability to detect one or more asyn- 
chronous faults occurring during the additional instruc- 
tion processing in the asynchronous event pipeline; 

a synchronization circuit coupled between the synchro- 
nous event pipeline and the asynchronous event pipe- 
line to synchronize an asynchronous fault with a 
replayed instruction in the synchronous event pipeline; 

a fault register; and 

age guarding logic coupled between the plurality of 55 
synchronous event pipelines and the fault register, the 



age guarding logic storing fault information in the fault 
register for an oldest instruction entry as compared 
between the instruction entries from the plurality of 
synchronous pipelines and the fault information 
already present in the fault register. 

13. The processor of claim 12 wherein the fault register 
stored fault information for an oldest instruction entry, the 
fault information including a specific fault code and an 
address. 

14. A method of processing faults in a processor com- 
prising: 

detecting that an instruction has executed improperly; 
replaying the instruction; 

performing additional processing on the instruction; 
detecting a fault in the additional instruction processing; 
and 

writing fault identification information to an instruction 
entry for the replayed instruction, wherein the perform- 
ing additional processes comprises performing a long 
latency operation for the instruction in parallel with the 
instruction being replayed. 

15. The method of claim 14 wherein the writing fault 
identification information comprises: 

comparing a sequence number of the instruction entry for 
the replayed instruction with the sequence number of 
the instruction having the fault; 
writing fault identification information to the instruction 
entry for the replayed instruction if there is a match. 

16. A method comprising: 

detecting that an instruction has executed improperly; 
replaying the instruction in a synchronous event pipeline; 
performing additional processing on the instruction in an 

asynchronous event pipeline; 
detecting an occurrence of an asynchronous fault during 
the additional processing of the instruction in the 
asynchronous event pipeline; 
identifying that the instructions in the asynchronous event 
pipeline having the fault matches the replayed instruc- 
tions in the synchronous event pipeline; and 
writing fault identification information into an instruction 
entry for the replayed in the synchronous event pipe- 
line. 

17. The method of claim 16 wherein said replaying 
comprises replaying the instructions in one of a plurality of 
synchronous event pipelines. 

18. The method of claim 16 and further comprising 
so storing fault identification information for an oldest instruc- 
tion in a fault register as compared between the instruction 
entry in the synchronous event pipeline having the fault 
identification information and fault identification informa- 
tion already present in the fault register. 
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